-
-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalidate gem name using levenshtein distance for gems with ten million downloads #2037
Conversation
44bc515
to
827a2e8
Compare
There was already some discussion before where I have pointed to some actually confusing limitations, but I can't find it now. Anyone any idea where it was discussed? From attached list I still see the same problem. For example:
Is the original purpose for this to make rubygems.org safer? I think there can be a better strategy to achieve that. I can try to elaborate as well if welcomed. |
This PR only blocks new gems where gem name is four or more characters. levenshtein distance threshold for four-letter gems is one and that of five or more is two. I have prepared the list of existing gem with invalid names so that we can decide if we want to block new releases on them. IMHO, we should block new releases too with an exception of these 1505 gems.
I understand deciding what constitutes as a wrong type is not easy and rules used in this PR won't protect against all typos. However, it seems good enough start and shouldn't have much impact on users.
Sure. Any suggestion on better ways of dealing with problems like #2028 is welcome. |
app/models/gem_typo.rb
Outdated
end | ||
|
||
def protected_gems | ||
Rubygem.by_downloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about to use something similar to:
Rubygem.joins(:gem_download)
.where("gem_downloads.count > ?", GemTypo::DOWNLOADS_THRESHOLD)
.where.not(name: @rubygem_name)
.pluck(:name)
- it doesn't do unnecessary order done by
by_downloads
scope - it filters out
@rubygem_name
on database level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 👍 Updated. Technically, this PR doesn't need this check but it will be needed when we remove :create
on validation.
private | ||
|
||
def distance_threshold | ||
@rubygem_name.size == GemTypo::SIZE_THRESHOLD ? 1 : 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about to "memoize" this to not execute same call on every iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to the initializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really good place to start and adjust from there 👍
47dc772
to
e0699a4
Compare
Use downloads count as threshold instead of static list. Limit validation to new gems. Update distance threshold as mentioned in segiddins PR. Existing gem we may consider blocking: https://gist.github.com/sonalkr132/af05b030af793ce17a69245152d5aa5f total: 4859 (2.98%)
e0699a4
to
5cd2749
Compare
@simi any issue with merging this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks ok to me. 👍
Using the number of downloads as a threshold (as opposed to top 1000 gems) would keep blocked list consistent. 4859 (3% of total) of existing gems have an invalid name, 1505 of which seem legit gems. We can use postgres to store the exclusion list and use a script to update it. I can create a follow-up PR if we should blocking release to existing invalid gems.
The current protected list would be of 683 gems. A larger protected list (say 1000 gems, 5 mil downloads) would mean a more significant portion of gem names would be blocked (4.5% invalid names of total).